-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[webgpu] Optimize Conv by im2col-matmul #26603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Lunar Lake Operation
sd-turbo
|
| const uint32_t kernel_height = onnxruntime::narrow<uint32_t>(kernel_shape[2]); | ||
| const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]); | ||
|
|
||
| TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; | |
| TensorShape ohwi_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]); | ||
|
|
||
| TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input}; | ||
| Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape); | |
| Tensor ohwi_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape); |
|
|
||
| const uint32_t M_tiles = ceil_div(im2col_m, tile_m); | ||
| const uint32_t N_tiles = ceil_div(im2col_n, tile_n); | ||
| im2col_mm_program.SetDispatchGroupSize(M_tiles, N_tiles, batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about enhancing the current TransposeProgram with shared path instead of adding a new one?
You are doing transpose from perm [0, 1, 2, 3] to perm [0, 2, 3, 1]. It equals that we are transposing from [o, i, hw] to [o, hw, i]. You can simply extend the DoTranspose with shared path to support any shape that only transpose the last two dimensions and keep the previous dimensions unchanged. Currently, the shared path only supports 2d transpose from new shape from perm [0, 1] to new shape with perm [1, 0]. We can extend it to transpose from [0, 1, 2] to [0, 2, 1] if the transpose meets the requirement that only transpose the last two dimensions by reshape it into 3d tensor [d0 * d1*...*dn-3, dn-2, dn-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
I intend to improve the current Transpose path discussed in the previous PR #26501.
Could I handle this as a separate task for a following PR?
| for (var inner_k_idx = 0u; inner_k_idx < TILE_K_VEC_SIZE; inner_k_idx++) { | ||
| let weight_data = weight_tile[inner_k_idx][local_idx]; | ||
| #if use_subgroup | ||
| let src_data = src_tile[inner_k_idx][sg_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the sg_size is larger than or less than TILE_M_SIZE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, Lunar Lake devices support a subgroup size of 32.
We must carefully add support for devices with different subgroup sizes, passing this value as a parameter to the template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, using subgroupShuffle improves the performance 5%~10%, comparing no subgroupShuffle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only want to support sg_size = 32, I prefer you write it like blow. Otherwise, you can't make sure the sg_size must be 32 unless your check the supported subgroup range with subgroupMinSize = subgroupMaxSize = 32.
if (sg_size == 32) {
let src_data = src_tile[inner_k_idx][sg_id];
for (var m_idx = 0u; m_idx < TILE_M_SIZE; m_idx++) {
results[m_idx] += output_element_t(dot(weight_data, subgroupShuffle(src_data, m_idx)));
}
}
else {
for (var m_idx = 0u; m_idx < TILE_M_SIZE; m_idx++) {
results[m_idx] += output_element_t(dot(weight_data, src_tile[inner_k_idx][m_idx]));
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that most modern GPUs typically support a subgroup size of 32 or greater, which this shader is designed to support.
The code's purpose is to mitigate potential performance penalties associated with the runtime conditional check (if (sg_size == 32)) within the shader.
if (sg_size == 32) {
// do somethings
}
Testing on Lunar Lake showed these penalties to be high.
Add comment in the code about it. Or I can simply make use_subgroup=false by default.
Description
This PR optimizes the
Convoperation by implementing two new compute shaders:oihw_to_ohwiandim2col-matmul.oihw_to_ohwi:Improves performance over the default Transpose shader by utilizing workgroup memory to ensure continuous memory read/write patterns.
im2col-matmul:Testing on Lunar Lake demonstrated up to an 87% performance improvement in Conv_2D operations.
Motivation and Context
See above.